Data fusion - Resolving Data Conflicts for Integration

نویسندگان

Xin Dong

Felix Naumann

چکیده

The amount of information produced in the world increases by 30% every year and this rate will only go up. With advanced network technology, more and more sources are available either over the Internet or in enterprise intranets. Modern data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, often require integrating available data sources and providing a uniform interface for users to access data from different sources; such requirements have been driving fruitful research on data integration over the last two decades [13, 15]. Data integration systems face two folds of challenges. First, data from disparate sources are often heterogeneous. Heterogeneity can exist at the schema level, where different data sources often describe the same domain using different schemas; it can also exist at the instance level, where different sources can represent the same real-world entity in different ways. There has been rich body of work on resolving heterogeneity in data, including, at the schema level, schema mapping and matching [17], model management [1], answering queries using views [14], data exchange [10], and at the instance level, record linkage (a.k.a., entity resolution, object matching, reference linkage, etc.) [9, 18], string similarity comparison [6], etc. Second, different sources can provide conflicting data. Conflicts can arise because of incomplete data, erroneous data, and out-of-date data. Returning incorrect data in a query result can be misleading and even harmful: one may contact a person by an out-of-date phone number, visit a clinic at a wrong address, carry wrong knowledge of the real world, and even make poor business decisions. It is thus critical for data integration systems to resolve conflicts from various sources and identify true values from false ones. This problem becomes especially prominent with the ease of publishing and spreading false information on the Web and has recently received increasing attention. This tutorial focuses on data fusion, which addresses the second challenge by fusing records on the same real-world

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Relational Operator Approach to Data Fusion

Integrated information systems provide users with a single unified view to heterogeneous data sources. As the resolution of schema level conflicts and the detection of fuzzy duplicates has been looked at more comprehensively, the problem of resolving data level conflicts still remains. We propose a relational data fusion operator, which fuses tuples representing the same real world entity by re...

متن کامل

Eliminating NULLs with Subsumption and Complementation

In a data integration process, an important step after schema matching and duplicate detection is data fusion. It is concerned with the combination or merging of different representations of one real-world object into a single, consistent representation. In order to solve potential data conflicts, many different conflict resolution strategies can be applied. In particular, some representations ...

متن کامل

Working Paper Alfred P. Sloan School of Management a Metadata Approach to Resolving Semantic Conflicts a Metadata Approach to Resolving Semantic Conflicts a Metadata Approach to Resolving Semantic Conflicts

Semantic reconciliation is an important step in determining logical connectivity between a data source (databcise) and a data receiver (application). Semantic reconciliation is used to determine if the semantics of the data provided by the source is meaningful to the receiver. In this paper we describe a rule-bzised approach to semantic specification and demonstrate how this specification can b...

متن کامل

Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach

While the Internet has facilitated access to information sources, the task of scalable integration of these heterogeneous data sources remains a challenge. The adoption of the eXtensible Markup Language (XML) as the standard for data representation and exchange has led to an increasing number of XML data sources, both native and non-native. Recent integration work has mainly focused on developi...

متن کامل

Preprocessing and Integration of Data from Multiple Sources for Knowledge Discovery

The explosive growth in the generation and collection of data has generated an urgent need for a new generation of techniques and tools that can assist in transforming these data intelligently and automatically into useful knowledge. Knowledge discovery is an emerging multidisciplinary field that attempts to fulfill this need. Knowledge discovery is a large process that includes data selection,...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 2 شماره

صفحات -

تاریخ انتشار 2009

Data fusion - Resolving Data Conflicts for Integration

نویسندگان

چکیده

منابع مشابه

A Relational Operator Approach to Data Fusion

Eliminating NULLs with Subsumption and Complementation

Working Paper Alfred P. Sloan School of Management a Metadata Approach to Resolving Semantic Conflicts a Metadata Approach to Resolving Semantic Conflicts a Metadata Approach to Resolving Semantic Conflicts

Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach

Preprocessing and Integration of Data from Multiple Sources for Knowledge Discovery

عنوان ژورنال:

اشتراک گذاری